CS 132 Data Exploration¶

Submitted by Group 17:

  • Alfonso, Francis Donald
  • Dizon, Julia Francesca
  • Paragas, Geri Angela

from CS 132 WFU

Dataset and Module Imports¶

We first import the necessary modules.

In [2]:
!pip install chart_studio
Requirement already satisfied: chart_studio in c:\users\julia\anaconda3\lib\site-packages (1.1.0)
Requirement already satisfied: plotly in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (5.9.0)
Requirement already satisfied: six in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (1.16.0)
Requirement already satisfied: retrying>=1.3.3 in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (1.3.4)
Requirement already satisfied: requests in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (2.28.1)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\julia\anaconda3\lib\site-packages (from plotly->chart_studio) (8.0.1)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (2023.5.7)
Requirement already satisfied: idna<4,>=2.5 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (1.26.14)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (2.0.4)
In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.offline as po
import plotly.express as px
import plotly.graph_objects as go
import chart_studio

pio.renderers.default = 'notebook'
po.init_notebook_mode()

The researchers collected data in two different phases:

  1. Phase 1: 150 tweets (Combined Dataset)
  2. Phase 2: 126 tweets (Sample of Additional Tweets)

It will be further explained in a future section, but the researchers decided to sample only 126 additional tweets in order to narrow the scope to the years 2020-2022. This would disinclude 24 tweets from the Phase 1 dataset.

To properly integrate the two datasets, we conduct the data preprocessing and data exploration in separate notebooks in order to have a clearer separation between the research phases.

Phase 1 Dataset¶

We import the original dataset and store it in a dataframe named original_dataset. We clone the dataset into a dataframe called original_tweets to be able to manipulate the data non-destructively.

In [4]:
url = "https://raw.githubusercontent.com/jolyuh/cs132-grp17-portfolio/main/assets/dataset/Combined%20Dataset%20-%20Group%2017.xlsx"

original_dataset = pd.read_excel(url)
original_tweets = original_dataset.copy()
original_tweets.shape
Out[4]:
(150, 37)

Phase 2 Dataset¶

We import the additional/supplemental dataset and store it in a dataframe called addtl_dataset. Just like with the original, we clone the dataset into a dataframe called addtl_tweets to be able to manipulate the data non-destructively.

In [5]:
url_2 = "https://raw.githubusercontent.com/jolyuh/cs132-grp17-portfolio/main/assets/dataset/Additional%20Tweets%20(Sample)%20-%20Group%2017.xlsx"

addtl_dataset = pd.read_excel(url_2)
addtl_tweets = addtl_dataset.copy()
addtl_tweets.shape
Out[5]:
(126, 37)

Creating the final dataset¶

The additional data was collected so that we would be able to conduct statistical tests on the collected tweets. Since the data collected in Phase 1 only consisted of red-tagging tweets, and the research hypothesis involved categorizing between Marcos, Duterte, or non-supporters, having only red-tagging tweets made it impossible to conduct the tests.

The data collected in Phase 2 consists of both red-tagging and non-redtagging tweets manually tagged by the researchers, and so two columns were added:

  1. Red-tagging: True or False
  2. Tone: "Positive", "Neutral", or "Negative"

The red-tagging column was added, self-explanatorily, to separate red-tagging from non-redtagging. However, the researchers felt the need to manually identify the Tone of the tweet, because not all non-redtagging tweets were necessarily positive in nature (e.g. "Anakbayan is trash!" isn't red-tagging, but still has an overall negative sentiment towards the organization).

To combine the datasets, we first add two columns to the original dataset. It would be easy to populate, since all of the tweets in the original dataset are red-tagging and negative in tone.

In [6]:
# Add Red-tagging and Tone columns to original dataset

original_tweets['Red-tagging'] = True
original_tweets['Tone'] = "Negative"

original_tweets.head(5)
Out[6]:
un Timestamp Tweet URL Group Collector Category Topic Keywords Account handle Account name ... Rating Reasoning Remarks Marcos supporter Duterte supporter Explanation for the political stance Red-tagging Tone Reviewer Review
0 17-1 2023-03-22 15:36:55 https://twitter.com/DelicadoJuanito/status/130... 17 Alfonso, Francis Donald REDT AnakBayan is a terrorist organization Anakbayan, NPA @DelicadoJuanito Juanito Delicado ... UNPROVEN Twelve months after the murders, no suspect ha... No location.\n\nThis tweet is a reply on a pos... False True Unknown stance on Marcos\n\nTweet about Dutert... True Negative NaN NaN
1 17-2 2023-03-26 20:25:45 https://twitter.com/MDSOnwardPH22/status/12975... 17 Alfonso, Francis Donald REDT AnakBayan is a terrorist organization Anakbayan, NPA @MDSOnwardPH22 ❤💞💞🇵🇭🇵🇭🙏🙏MARCOS-DUTERTE👊🏽👊🏽👊🏽 ... False Labelling Anakbayan as the legal NPA front org... The video is alluding that Anakbayan is part o... True True Account name True Negative NaN NaN
2 17-3 2023-03-26 22:34:19 https://twitter.com/ysaysantos24/status/127895... 17 Alfonso, Francis Donald REDT AnakBayan is a terrorist organization Anakbayan, NPA @ysaysantos24 Relissa Lucena ... False Labelling Anakbayan as the legal NPA front org... No location.\n\nStart of a thread False True Unknown stance on Marcos \n\nLikes tweets supp... True Negative NaN NaN
3 17-4 2023-03-30 02:44:36.674000 https://twitter.com/Dbigbalbowski/status/97052... 17 Dizon, Julia Francesca REDT AnakBayan is a terrorist organization Anakbayan, terorista @Dbigbalbowski Mr James🇵🇭 ... UNPROVEN It has not been proven that Myles Albasin is p... No bio.\n\nThe tweet contains a collage of pic... True True Liked tweets supporting Duterte and Marcos. True Negative NaN NaN
4 17-5 2023-03-30 02:49:23.538000 https://twitter.com/Earth751/status/9700304405... 17 Dizon, Julia Francesca REDT AnakBayan is a terrorist organization Anakbayan, terorista, NPA @Earth751 Earth@75 ... NaN Labelling Anakbayan as the legal NPA front org... Account was last active in 2018. False True Tweets and likes content in support of or rela... True Negative NaN NaN

5 rows × 37 columns

We then concatenate the two datasets into a dataframe called final_dataset and clone that into a dataframe named tweets to be used in the data preprocessing proper.

In [7]:
final_dataset = pd.concat([original_tweets, addtl_tweets], ignore_index=True)

tweets = final_dataset.copy()
tweets.shape
Out[7]:
(276, 38)

Exploring Data¶

Dropping unnecessary columns¶

Since the original dataset contains columns that are not needed for the data exploration, we drop those columns, namely:

  • ID
  • Timestamp
  • Group
  • Collector
  • Category
  • Topic
  • Keywords
  • Reviewer
  • Review
  • Screenshot
  • Views
  • Rating
  • Reasoning
  • Remarks
  • Reviewer
  • Review

These columns' purpose is to help the researchers distinguish the samples on a meta level and are not necessary for analysis.

In [8]:
tweets.columns
Out[8]:
Index(['un', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
       'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning', 'Remarks', 'Marcos supporter',
       'Duterte supporter', 'Explanation for the political stance',
       'Red-tagging', 'Tone', 'Reviewer', 'Review', 'ID'],
      dtype='object')
In [9]:
tweets = tweets.drop(columns=[
                                'ID',
                                'Timestamp',
                                'Group',
                                'Collector',
                                'Category',
                                'Topic',
                                'Keywords',
                                'Reviewer',
                                'Review',
                                'Screenshot',
                                'Views',
                                'Rating',
                                'Reasoning',
                                'Remarks',
                                'Reviewer',
                                'Review'
                             ]
                    )

Modifying column names¶

For ease of coding, we rename each of the column names in snake case format.

In [10]:
column_names = np.array(tweets.columns, copy=True).tolist()
column_names
Out[10]:
['un',
 'Tweet URL',
 'Account handle',
 'Account name',
 'Account bio',
 'Account type',
 'Joined',
 'Following',
 'Followers',
 'Location',
 'Tweet',
 'Tweet Translated',
 'Tweet Type',
 'Date posted',
 'Content type',
 'Likes',
 'Replies',
 'Retweets',
 'Quote Tweets',
 'Marcos supporter',
 'Duterte supporter',
 'Explanation for the political stance',
 'Red-tagging',
 'Tone']
In [11]:
def snake_caseify_column(name):
    if name == 'Explanation for the political stance':
        return 'stance_explanation'
    elif name == 'Red-tagging':
        return 'red_tagging'

    return '_'.join(name.lower().split())

new_col_names = list(map(snake_caseify_column, column_names))
new_col_names
Out[11]:
['un',
 'tweet_url',
 'account_handle',
 'account_name',
 'account_bio',
 'account_type',
 'joined',
 'following',
 'followers',
 'location',
 'tweet',
 'tweet_translated',
 'tweet_type',
 'date_posted',
 'content_type',
 'likes',
 'replies',
 'retweets',
 'quote_tweets',
 'marcos_supporter',
 'duterte_supporter',
 'stance_explanation',
 'red_tagging',
 'tone']
In [12]:
tweets.columns = new_col_names
In [13]:
tweets.head(5)
Out[13]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
0 17-1 https://twitter.com/DelicadoJuanito/status/130... @DelicadoJuanito Juanito Delicado KKK Philippines member last generation Miriam ... Anonymous 2020-08-01 38.0 3 NaN ... Emotional 0.0 0.0 0.0 0.0 False True Unknown stance on Marcos\n\nTweet about Dutert... True Negative
1 17-2 https://twitter.com/MDSOnwardPH22/status/12975... @MDSOnwardPH22 ❤💞💞🇵🇭🇵🇭🙏🙏MARCOS-DUTERTE👊🏽👊🏽👊🏽 When helping the poor,leave the camera at home! Anonymous 2018-05-01 7982.0 12431 Manila ... Emotional 37.0 2.0 14.0 4.0 True True Account name True Negative
2 17-3 https://twitter.com/ysaysantos24/status/127895... @ysaysantos24 Relissa Lucena YaKaP ng Magulang\nAj is my life, maging makat... Identified 2020-01-01 183.0 953 NaN ... Emotional 84.0 4.0 28.0 3.0 False True Unknown stance on Marcos \n\nLikes tweets supp... True Negative
3 17-4 https://twitter.com/Dbigbalbowski/status/97052... @Dbigbalbowski Mr James🇵🇭 NaN Anonymous 2011-05-01 1676.0 8352 New York, USA ... Emotional 0.0 0.0 0.0 0.0 True True Liked tweets supporting Duterte and Marcos. True Negative
4 17-5 https://twitter.com/Earth751/status/9700304405... @Earth751 Earth@75 ako y simpling tao maka diyos.makatao at makab... Anonymous 2017-03-01 9.0 2 KSA KHOBAR ... Emotional 0.0 0.0 0.0 0.0 False True Tweets and likes content in support of or rela... True Negative

5 rows × 24 columns

Checking for missing data¶

Below is a summary of the dataframe information. It shows us at a glance the number of non-null entries in the dataframe. Because we have 150 samples, if the number of non-null objects is less that 150, then there are some holes in our data. For our data pre-processing, then, we investigate these holes.

In [14]:
tweets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276 entries, 0 to 275
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   un                  150 non-null    object        
 1   tweet_url           276 non-null    object        
 2   account_handle      276 non-null    object        
 3   account_name        275 non-null    object        
 4   account_bio         233 non-null    object        
 5   account_type        276 non-null    object        
 6   joined              276 non-null    datetime64[ns]
 7   following           150 non-null    float64       
 8   followers           276 non-null    int64         
 9   location            159 non-null    object        
 10  tweet               276 non-null    object        
 11  tweet_translated    195 non-null    object        
 12  tweet_type          276 non-null    object        
 13  date_posted         276 non-null    object        
 14  content_type        275 non-null    object        
 15  likes               275 non-null    float64       
 16  replies             275 non-null    float64       
 17  retweets            275 non-null    float64       
 18  quote_tweets        275 non-null    float64       
 19  marcos_supporter    276 non-null    bool          
 20  duterte_supporter   276 non-null    bool          
 21  stance_explanation  275 non-null    object        
 22  red_tagging         276 non-null    bool          
 23  tone                275 non-null    object        
dtypes: bool(3), datetime64[ns](1), float64(5), int64(1), object(14)
memory usage: 46.2+ KB
In [15]:
# This summarizes the columns that do have null values.
for i in tweets.columns[tweets.isna().any()].tolist():
  print(i)
un
account_name
account_bio
following
location
tweet_translated
content_type
likes
replies
retweets
quote_tweets
stance_explanation
tone

Filling in missing values¶

However, while we should be filling in holes where we can, not all values are required nor available. For example, a Twitter account is not required to have a Location or an Account bio, which is why there are more null values in their columns than the others. The column Views is also largely null because the Views feature of Tweets only started rolling out in late December 2022, which is only a small period in the date range covered by our data. Thus, for our data clean-up, we only pay attention to null values that represent an actual lack in information that is needed for our research. These relevant columns are:

  • Account name
  • Content type
  • Likes
  • Replies
  • Retweets
  • Quote Tweets

To correct fields about Tweet or Twitter account info, our method of fixing it will be to replace the NaN values with what is currently. For fields that require our own assessment, we will be filling them out with our input as well.

Account Name¶

In [16]:
tweets[tweets['account_name'].isna()]
Out[16]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
37 17-38 https://twitter.com/crux_sonata/status/1270296... @crux_sonata NaN mahalin natin aNg pilipinas Anonymous 2016-11-01 272.0 1416 Philippines ... Emotional 1.0 0.0 0.0 0.0 False True Has posts supporting Duterte True Negative

1 rows × 24 columns

In [17]:
tweets.at[37,'account_name']="Crux of the Matter 🕊🏃‍♀️🦅🏃‍♀️"

Content Type¶

In [18]:
tweets[tweets['content_type'].isna()]
Out[18]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
27 17-28 https://twitter.com/dyslexiczs/status/94222907... @dyslexiczs ᜉᜐᜅ͓ ᜄᜎ I am dyslexic and my opinion matters\n Anonymous 2016-10-01 279.0 79 NaN ... NaN NaN NaN NaN NaN False True Supports Duterte [1]\n\nDoes not support Marco... True Negative

1 rows × 24 columns

In [19]:
tweets.at[27, 'content_type'] = "Emotional"

Likes, Replies, Retweets, and Quote Tweets¶

In [20]:
tweets[tweets['likes'].isna()]
Out[20]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
27 17-28 https://twitter.com/dyslexiczs/status/94222907... @dyslexiczs ᜉᜐᜅ͓ ᜄᜎ I am dyslexic and my opinion matters\n Anonymous 2016-10-01 279.0 79 NaN ... Emotional NaN NaN NaN NaN False True Supports Duterte [1]\n\nDoes not support Marco... True Negative

1 rows × 24 columns

In [21]:
tweets[tweets['replies'].isna()]
Out[21]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
27 17-28 https://twitter.com/dyslexiczs/status/94222907... @dyslexiczs ᜉᜐᜅ͓ ᜄᜎ I am dyslexic and my opinion matters\n Anonymous 2016-10-01 279.0 79 NaN ... Emotional NaN NaN NaN NaN False True Supports Duterte [1]\n\nDoes not support Marco... True Negative

1 rows × 24 columns

In [22]:
tweets[tweets['retweets'].isna()]
Out[22]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
27 17-28 https://twitter.com/dyslexiczs/status/94222907... @dyslexiczs ᜉᜐᜅ͓ ᜄᜎ I am dyslexic and my opinion matters\n Anonymous 2016-10-01 279.0 79 NaN ... Emotional NaN NaN NaN NaN False True Supports Duterte [1]\n\nDoes not support Marco... True Negative

1 rows × 24 columns

In [23]:
tweets[tweets['quote_tweets'].isna()]
Out[23]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
27 17-28 https://twitter.com/dyslexiczs/status/94222907... @dyslexiczs ᜉᜐᜅ͓ ᜄᜎ I am dyslexic and my opinion matters\n Anonymous 2016-10-01 279.0 79 NaN ... Emotional NaN NaN NaN NaN False True Supports Duterte [1]\n\nDoes not support Marco... True Negative

1 rows × 24 columns

In [24]:
# The sample with NaN value (tweet #27) is the same across these four characteristics:
# All these values are 0 for this tweet

tweets.at[27, 'likes'] = 0
tweets.at[27, 'replies'] = 0
tweets.at[27, 'retweets'] = 0
tweets.at[27, 'quote_tweets'] = 0

Political Stance¶

In [25]:
tweets[tweets['stance_explanation'].isna()]
Out[25]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
25 17-26 https://twitter.com/merilla2010/status/9398245... @merilla2010 Marski 👊👊👊 Simple Man Anonymous 2010-08-01 612.0 1035 Saudi Arabia ... Emotional 4.0 1.0 2.0 0.0 False False NaN True Negative

1 rows × 24 columns

Since the lack of an explanation means the researchers overlooked to assess the tweeter's political stance, we modify the values here.

In [26]:
tweets.at[25, 'marcos_supporter'] = False
tweets.at[25, 'duterte_supporter'] = True
tweets.at[25, 'stance_explanation'] = "Display name has fist emojis commonly associated with Duterte. Not enough data to show support for Marcos."

Ensuring consistent data formatting¶

Date posted column¶

During the process of data visualization in a subsequent section, it was found that not all of the values in the Date posted column were read by Pandas as Python datetime objects, because they were in the DD/MM/YY HH:MM format instead of the YYYY-MM-DD HH:MM:SS format. Thus, they were read as str objects instead.

In [27]:
from datetime import datetime

string_dates = tweets[tweets['date_posted'].apply(lambda x: isinstance(x, str))]
datetime_dates = tweets[tweets['date_posted'].apply(lambda x: isinstance(x, datetime))]

print(f"Dates in str format: {string_dates.shape[0]}")
print(f"Dates in datetime format: {datetime_dates.shape[0]}")
Dates in str format: 30
Dates in datetime format: 246
In [28]:
string_dates['date_posted'].head(5)
Out[28]:
44    14/05/22 10:31
46    15/04/21 08:51
47    27/01/21 15:34
48    29/10/20 10:45
50    24/02/22 10:56
Name: date_posted, dtype: object
In [29]:
datetime_dates['date_posted'].head(5)
Out[29]:
0    2020-08-30 19:30:00
1    2020-08-23 20:12:00
2    2020-07-03 15:26:00
3    2018-03-05 13:16:40
4    2018-03-04 04:17:33
Name: date_posted, dtype: object

To fix this, we replace the original Date posted column with a modified version that creates a datetime object based on the value from the DD/MM/YY HH:MM formatted string.

In [30]:
def get_date_slice(date):
  date_arr = date.split('/')
  return list(map(lambda x: int(x), date_arr))

def get_time_slice(time):
  time_arr = time.split(':')
  return list(map(lambda x: int(x), time_arr))

def get_datetime_from_str(date_str):
  if isinstance(date_str, datetime):
    return date_str

  split_date_time = date_str.split(' ')

  date = get_date_slice(split_date_time[0])
  time = get_time_slice(split_date_time[1])

  return datetime(2000 + date[2], date[1], date[0], time[0], time[1])

tweets_test = tweets['date_posted'].map(get_datetime_from_str)
tweets['date_posted'] = tweets_test
tweets['date_posted']
Out[30]:
0     2020-08-30 19:30:00
1     2020-08-23 20:12:00
2     2020-07-03 15:26:00
3     2018-03-05 13:16:40
4     2018-03-04 04:17:33
              ...        
271   2020-03-28 17:23:17
272   2020-08-24 22:05:42
273   2021-08-28 19:54:31
274   2021-12-25 10:50:33
275   2022-12-29 18:45:12
Name: date_posted, Length: 276, dtype: datetime64[ns]

As a result of our processing, all of the values in the Date posted column are now datetime objects.

In [31]:
tweets['date_posted'].apply(lambda x: isinstance(x, datetime)).describe()
Out[31]:
count      276
unique       1
top       True
freq       276
Name: date_posted, dtype: object

Removing out-of-scope values¶

Since the Date Posted column values were all properly converted to datetime objects, we can now remove tweets posted before 2020 from the dataset. Since rows were removed, we also reset the indices of the dataframe in order to avoid confusion.

In [32]:
tweets = tweets[~tweets.date_posted.apply(lambda x: x.year < 2020)]
tweets.reset_index(drop=True, inplace=True)

tweets
Out[32]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... content_type likes replies retweets quote_tweets marcos_supporter duterte_supporter stance_explanation red_tagging tone
0 17-1 https://twitter.com/DelicadoJuanito/status/130... @DelicadoJuanito Juanito Delicado KKK Philippines member last generation Miriam ... Anonymous 2020-08-01 00:00:00 38.0 3 NaN ... Emotional 0.0 0.0 0.0 0.0 False True Unknown stance on Marcos\n\nTweet about Dutert... True Negative
1 17-2 https://twitter.com/MDSOnwardPH22/status/12975... @MDSOnwardPH22 ❤💞💞🇵🇭🇵🇭🙏🙏MARCOS-DUTERTE👊🏽👊🏽👊🏽 When helping the poor,leave the camera at home! Anonymous 2018-05-01 00:00:00 7982.0 12431 Manila ... Emotional 37.0 2.0 14.0 4.0 True True Account name True Negative
2 17-3 https://twitter.com/ysaysantos24/status/127895... @ysaysantos24 Relissa Lucena YaKaP ng Magulang\nAj is my life, maging makat... Identified 2020-01-01 00:00:00 183.0 953 NaN ... Emotional 84.0 4.0 28.0 3.0 False True Unknown stance on Marcos \n\nLikes tweets supp... True Negative
3 17-12 https://twitter.com/BasarteDiaz/status/1294961... @BasarteDiaz Senior Herudes Electrical Engineer\nProud DDS\nDuterte Delici... Identified 2020-03-01 00:00:00 195.0 45 NaN ... Emotional 0.0 0.0 0.0 0.0 False True From the bio True Negative
4 17-15 https://twitter.com/RightWingPinoy/status/1287... @RightWingPinoy A Machiavelli fascist • right wing • business Anonymous 2019-03-01 00:00:00 180.0 68 Calamba City, Calabarzon ... Rational 0.0 0.0 0.0 0.0 True True Retweeted a thread promoting Marcos and Duterte True Negative
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
247 NaN https://twitter.com/icekemberloo/status/124383... icekemberloo yuki 🐕🐶🍓🥝🧀🍖🍗🥓🍱🍙🍳🥚🥙🌯🍜🍩🍫🍬 Anonymous 2020-03-25 23:21:15 NaN 13 National Capital Region, Repub ... Rational, Emotional 1.0 0.0 0.0 0.0 True True Has multiple tweets supporting Duterte-Marcos False Negative
248 NaN https://twitter.com/PilipinasScm/status/129789... PilipinasScm Student Christian Movement of the PH #SCMPat62 Est 27 Dec 1960. Follow Christ, (Rom 3:22) Lov... Media 2019-04-08 19:22:59 NaN 4361 NaN ... Rational 0.0 0.0 0.0 0.0 False False Media account False Neutral
249 NaN https://twitter.com/anakbayanMBHS/status/14315... anakbayanMBHS Anakbayan Metro-Baguio High School Isang komprehensibong pambansa demokratikong p... Media 2020-04-27 21:29:57 NaN 348 Baguio City ... Rational, Emotional 10.0 1.0 5.0 0.0 False False Anakbayan account False Neutral
250 NaN https://twitter.com/anakbayan_ne/status/147457... anakbayan_ne Anakbayan Nueva Ecija Lumalakas, lumalawak, lumalaban! #SumaliSaAnak... Media 2020-06-14 22:11:58 NaN 889 Nueva Ecija ... Emotional 5.0 0.0 8.0 0.0 False False Anakbayan account False Neutral
251 NaN https://twitter.com/BoniArtKolektib/status/160... BoniArtKolektib Bonifacio Artists Collective Bonifacio Artists Collective is a cultural org... Media 2021-11-18 17:26:44 NaN 119 Taguig City ... Rational 0.0 0.0 0.0 0.0 False False https://twitter.com/BoniArtKolektib/status/157... False Neutral

252 rows × 24 columns

Categorical data encoding¶

Our only data that requires encoding are the columns for Marcos supporter and Duterte supporter.

In [33]:
tweets['marcos_supporter'] = tweets['marcos_supporter'].replace({True: 1, False: 0})
tweets['marcos_supporter']
Out[33]:
0      0
1      1
2      0
3      0
4      1
      ..
247    1
248    0
249    0
250    0
251    0
Name: marcos_supporter, Length: 252, dtype: int64
In [34]:
tweets['duterte_supporter'] = tweets['duterte_supporter'].replace({True: 1, False: 0})
tweets['duterte_supporter']
Out[34]:
0      1
1      1
2      1
3      1
4      1
      ..
247    1
248    0
249    0
250    0
251    0
Name: duterte_supporter, Length: 252, dtype: int64

Preparing for the chi$^2$ test¶

One of the assumptions of the chi$^2$ test is that a single observation must be a distinct choice between classes. This currently does not fit our dataset, as we aim to test the columns marcos_supporter and duterte_supporter, with both being checked if they're a supporter of both, and neither being checked if they're a supporter of neither.

Thus, we create four new columns following the one-hot encoding scheme:

  1. supports_marcos_only
  2. supports_duterte_only
  3. supports_both
  4. supports_neither
In [35]:
tweets['supports_marcos_only'] = tweets['marcos_supporter'] & ~tweets['duterte_supporter']
tweets['supports_duterte_only'] = ~tweets['marcos_supporter'] & tweets['duterte_supporter']
tweets['supports_both'] = tweets['marcos_supporter'] & tweets['duterte_supporter']
tweets['supports_neither'] = tweets['marcos_supporter'] + tweets['duterte_supporter'] == 0
tweets['supports_neither'].replace({True: 1, False: 0}, inplace=True)

# For data visualization:

def pol_stance(x):
    if x['marcos_supporter'] & ~x['duterte_supporter']:
        return "Marcos only"
    elif ~x['marcos_supporter'] & x['duterte_supporter']:
        return "Duterte only"
    elif x['marcos_supporter'] & x['duterte_supporter']:
        return "Marcos-Duterte"
    return 'Neither'


# pol_labels = ["Supports both", "Supports Duterte only", "Supports Marcos only", "Supports Neither"]

tweets['pol_stance'] = tweets.apply(pol_stance, axis=1)
In [36]:
tweets[['supports_marcos_only', 'supports_duterte_only', 'supports_both', 'supports_neither', 'pol_stance']]
Out[36]:
supports_marcos_only supports_duterte_only supports_both supports_neither pol_stance
0 0 1 0 0 Duterte only
1 0 0 1 0 Marcos-Duterte
2 0 1 0 0 Duterte only
3 0 1 0 0 Duterte only
4 0 0 1 0 Marcos-Duterte
... ... ... ... ... ...
247 0 0 1 0 Marcos-Duterte
248 0 0 0 1 Neither
249 0 0 0 1 Neither
250 0 0 0 1 Neither
251 0 0 0 1 Neither

252 rows × 5 columns

In [37]:
tweets[['supports_marcos_only', 'supports_duterte_only', 'supports_both', 'supports_neither']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   supports_marcos_only   252 non-null    int64
 1   supports_duterte_only  252 non-null    int64
 2   supports_both          252 non-null    int64
 3   supports_neither       252 non-null    int64
dtypes: int64(4)
memory usage: 8.0 KB

Ensuring consistent language across all tweets¶

In [38]:
tweets[tweets['tweet_translated'].isna()].shape[0]
Out[38]:
76

Before conducting natural language processing on the dataset, we must ensure that the tweets used are all in English. However, since not all tweets had to be translated, the dataset's tweet_translated column has some empty values.

In order to preserve the original values in the dataset, we create a new column final_tweet to collect the tweet contents that will be used in the data processing.

In [39]:
tweets['final_tweet'] = tweets['tweet_translated'].fillna(tweets['tweet'])
tweets['final_tweet']
Out[39]:
0      You killed them, not the government. Don't you...
1      Poor youth. They are wasting their bright futu...
2      Why do we need to protect our children against...
3      They are the group that knows nothing but free...
4      That's nice. LFS, Anakbayan, Kabataan, Gabriel...
                             ...                        
247    @CarmiLu68 so true! .. most of them are millen...
248    @SCMP_Tuguegarao @SCMPDavao @scmpmetrobaguio @...
249    We are the Anakbayan Metro-Baguio Highschool, ...
250    Along with our celebration of Christmas Day, w...
251    Temporary Anakbayan FB page: https://t.co/lW80...
Name: final_tweet, Length: 252, dtype: object
In [40]:
tweets[tweets['final_tweet'].isna()]
Out[40]:
un tweet_url account_handle account_name account_bio account_type joined following followers location ... duterte_supporter stance_explanation red_tagging tone supports_marcos_only supports_duterte_only supports_both supports_neither pol_stance final_tweet

0 rows × 30 columns

Exploring the numbers¶

With our dataset cleaned up, we can look at the distribution of values in the dataset.

In [41]:
tweets.describe().style
Out[41]:
  following followers likes replies retweets quote_tweets marcos_supporter duterte_supporter supports_marcos_only supports_duterte_only supports_both supports_neither
count 126.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000
mean 725.126984 2711.575397 16.702381 1.067460 6.158730 0.670635 0.440476 0.503968 0.039683 0.103175 0.400794 0.456349
std 1329.188107 6638.969192 54.263511 4.209576 20.321093 3.805879 0.497432 0.500979 0.195601 0.304792 0.491035 0.499082
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 130.000000 168.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 302.000000 564.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
75% 745.500000 1420.750000 8.000000 1.000000 2.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000
max 9381.000000 29298.000000 531.000000 46.000000 203.000000 41.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
In [42]:
# Check tone of tweets
print("Number of positive tweets: ", tweets[tweets['tone'] == 'Positive']['tone'].count())
print("Number of negative tweets: ", tweets[tweets['tone'] == 'Negative']['tone'].count())
print("Number of neutral tweets: ", tweets[tweets['tone'] == 'Neutral']['tone'].count())
Number of positive tweets:  15
Number of negative tweets:  144
Number of neutral tweets:  92

Visualizing the data¶

We then try to visualize a general overview of the tweets in our dataset based on our classification of whether or not they are Marcos or Duterte supporters.

Distribution of political stance¶

In [43]:
marcos = tweets.query("marcos_supporter == 1").shape[0]
duterte = tweets.query("duterte_supporter == 1").shape[0]

marcos_duterte = tweets.query("marcos_supporter == 1 and duterte_supporter == 1").shape[0]
marcos_only = tweets.query("marcos_supporter == 1 and duterte_supporter == 0").shape[0]
duterte_only = tweets.query("marcos_supporter == 0 and duterte_supporter == 1").shape[0]
neither = tweets.query("marcos_supporter == 0 and duterte_supporter == 0").shape[0]

total = tweets.shape[0]
In [44]:
pie_data = np.array([marcos_duterte, marcos_only, duterte_only, neither])
pie_labels = [
    "Marcos-Duterte",
    "Marcos only",
    "Duterte only",
    "Neither"
]

interactive_pie = go.Pie(labels=pie_labels, values=pie_data)
fig = go.Figure(data=interactive_pie)
fig.update_layout(
    title_text='Poster Political Leaning', # title of plot
    width=625
)
fig.show(width=625)
# chart_studio.plotly.iplot(fig, filename = 'political-leaning', auto_open=True)

Distribution of content type¶

In [45]:
rational = tweets.query("content_type == \"Rational\"").shape[0]
emotional = tweets.query("content_type == \"Emotional\"").shape[0]
transactional = tweets.query("content_type == \"Transactional\"").shape[0]

content_type_data = np.array([rational, emotional, transactional])
content_type_labels = [
    "Rational",
    "Emotional",
    "Transactional",
]

acct_type_counts = pd.DataFrame({
    'Content Type': content_type_labels,
    'No. of tweets': content_type_data
})

fig = px.bar(acct_type_counts, x="Content Type", y="No. of tweets", title="Content Type of collected tweets")
fig.update_layout(
    width=625
)
fig.show()

It can be seen that the majority of the tweets collected were Emotional, with a lot of them also being replies to other tweets.

In [46]:
emotional_tweets = tweets.query("content_type == 'Emotional'")

emotional_tweets[['tweet', 'content_type', 'tweet_type']]

reply_count = emotional_tweets[emotional_tweets['tweet_type'].str.contains('Reply')].shape[0]

print(f"Number of Emotional tweets that are also replies: {reply_count}")
emotional_tweets[['tweet', 'content_type', 'tweet_type']]
Number of Emotional tweets that are also replies: 105
Out[46]:
tweet content_type tweet_type
0 Kayo po pumatay d ang government. Huwag nyo k... Emotional Text, Reply
1 Kawawang kabataan,sinayang ang magandang kinab... Emotional Text, Video
2 Bakit namin kailangan protektahan ang aming mg... Emotional Text, Image
3 Sila yung grupo na walang alam kundi freedom o... Emotional Text, Image
5 May mga woke na taga US din. Nagco-comment ka... Emotional Text, Reply
... ... ... ...
240 Paano natin nasasabing "pasista" ang isang "pa... Emotional Text
241 hapee birthday anakbayan ne ako toh si anna ta... Emotional Text
245 Sumali sa Anakbayan PUP-COC: https://t.co/ezVZ... Emotional Text, Reply, Link
246 Sabay-sabay nating palawakin ang ating kaalama... Emotional Text, Reply
250 Kasabay ng ating pagsasalo-salo ngayong Araw n... Emotional Text, Image

171 rows × 3 columns

Distribution of date posted¶

In [47]:
def all_quarters():
  ret = []
  years = [str(x) for x in range(2020, 2023)]
  quarters = ["Q1", "Q2", "Q3", "Q4"]
  for year in years:
    for qtr in quarters:
      ret.append(year + qtr)
  return ret

quarter_posted = pd.PeriodIndex(tweets["date_posted"], freq='Q')
tweets['quarter_posted'] = quarter_posted

quarter_counts = list(map(lambda qtr: (tweets['quarter_posted']==qtr).sum(), all_quarters()))
quarter_counts
Out[47]:
[10, 54, 18, 11, 15, 13, 12, 17, 33, 53, 5, 11]
In [55]:
# Heatmap of when collected tweets were collected

data=np.array([[quarter_counts[x] for x in range(i*4,i*4 + 4)] for i in range(0, 3)])
data = data.T

fig = px.imshow(data,
                labels=dict(x="Year", y="Month", color="Tweets"),
                x=[str(x) for x in range(2020, 2023)],
                y=['Q1', 'Q2', 'Q3', 'Q4'],
                title="Distribution of 'Date posted' for tweets by quarter"
               )
fig.update_layout(
    width=625
)
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'date-posted', auto_open=True)

By visualizing the distribution of post dates, we were hoping to gain additional insights on the context of the increase or decrease in the number of redtagging tweets posted during certain periods.

From the 252 tweets collected, the greatest number of redtagging tweets were found during following quarters: Q2 2020, Q2 2022, and Q1 2022.

Though the scope is limited to the 252 tweets that were collected by the researchers, which could have been affected by any biases introduced by Twitter's search algorithm, the surge in numbers coincides with two major events:

  • 2nd Quarter of 2020: The first few months of the COVID-19 pandemic lockdown
  • 2nd Quarter of 2022: The 2022 Presidential Elections
  • 1st Quarter of 2022: Campaign Period for the 2022 Presidential Elections

Red-tagging Political Distributions¶

In [56]:
# creating a copy of tweets df to sort

pol_df = tweets[['red_tagging', 'pol_stance', 'tone']]
# pol_df.sort_values(['pol_stance'],ascending=[True],inplace=True)

fig = px.histogram(pol_df, x="red_tagging",
             color='pol_stance', barmode='group',
             histfunc='count', color_discrete_map={
                'Marcos-Duterte' : '#ef543a',
                'Marcos only' : '#ab63fa',
                'Duterte only' : '#00cd97',
                'Neither' : '#636ffb'
            })
fig.update_layout(
    title_text='Distribution of political stances of red-tagging vs non-red-tagging', # title of plot
    xaxis_title_text='Is the tweet red-tagging?', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1, # gap between bars of the same location coordinates
    legend_title="Political Stance",
    width=625
)

fig.show()
# chart_studio.plotly.iplot(fig, filename = 'red-vs-nonred', auto_open=True)

Visualizing the presence of red-tagging in a tweet compared with the political stance of the user that tweeted it, it can be seen that an overwhelming majority of the red-tagging tweets came from Marcos-Duterte, Duterte-only, and Marcos-only supporters. 125 out of 131 red-tagging tweets came from those groups.

The chart also shows that a majority of the non-redtagging tweets came from those who support neither, with 109 out of the 121 non-redtagging tweets being from that group.

In [57]:
fig = px.histogram(pol_df, x="tone",
             color='pol_stance', barmode='group',
             histfunc='count',
             color_discrete_map={
                'Marcos-Duterte' : '#ef543a',
                'Marcos only' : '#ab63fa',
                'Duterte only' : '#00cd97',
                'Neither' : '#636ffb'
            })

fig.update_layout(
    title_text='Distribution of political stances of different tones of tweets', # title of plot
    xaxis_title_text='Tone', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1, # gap between bars of the same location coordinates
    legend_title="Political Stance",
    width=625

)
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'tweet-tones', auto_open=True)

The tones of the tweets also had a great reflection on political stance: no supporters of Marcos, Duterte, or both had anything positive to say about the organization Anakbayan. Only one Marcos-Duterte supporter tweeted something with a neutral sentiment. Apart from that single tweet, the rest of the tweets from Marcos-only, Duterte-only, and Marcos-Duterte supporters all had a negative tone.

While around 8 supporters of neither also expressed negative views toward Anakbayan, the rest of the tweets from that group had either neutral or positive sentiments.

Data Modeling¶

Statistical Model¶

Chi-square test¶

For this test, we utilized the Chi-square test for independence with the Bonferroni correction applied (Wijaya, 2020). This is because we're testing for the independence of the presence of red-tagging from four different one-hot encoded features: supports_marcos_only, supports_duterte_only, supports_both, and supports_neither.

This means that for a confidence level of 95% (p-value < 0.05), since we're doing comparisons among four columns, we divide 0.05/4 to arrive at 0.0125 as our adjusted signifance level (α).

$H$: Fanatics/apologists of Duterte are more likely to red-tag Anakbayan as a terrorist organization.

$H_A$: Fanatics/apologists of Marcos are more likely to red-tag Anakbayan as a terrorist organization.

$H_0$: Both fanatics and non-fanatics are equally likely to participate in the red-tagging of Anakbayan.

In [51]:
from scipy.stats import chi2_contingency

dataset_columns = ['supports_duterte_only', 'supports_marcos_only', 'supports_both', 'supports_neither']

result_columns = []

for col in dataset_columns:
  df_col = tweets[col]
  bon_alpha = 0.05/len(dataset_columns)

  contingency_table = pd.crosstab(tweets['red_tagging'], df_col)
  chi2, p_value, dof, expected = chi2_contingency(contingency_table, correction=False)

  hypothesis_result = 'Reject null hypothesis' if p_value < bon_alpha else 'Fail to reject null hypothesis'

  result_values = [col, hypothesis_result, chi2, p_value, bon_alpha, dof, expected]
  result_columns.append(result_values)


res_chi_ph = pd.DataFrame(data = result_columns)
res_chi_ph.columns = [
    'Political Leaning',
    'Hypothesis',
    'chi^2 statistic',
    'p-value',
    'Bonferroni α',
    'Degrees of freedom',
    'Expected frequencies'
]
res_chi_ph
Out[51]:
Political Leaning Hypothesis chi^2 statistic p-value Bonferroni α Degrees of freedom Expected frequencies
0 supports_duterte_only Reject null hypothesis 22.659960 1.933553e-06 0.0125 1 [[108.51587301587301, 12.484126984126984], [11...
1 supports_marcos_only Fail to reject null hypothesis 3.274447 7.036666e-02 0.0125 1 [[116.1984126984127, 4.801587301587301], [125....
2 supports_both Reject null hypothesis 103.265098 2.931729e-24 0.0125 1 [[72.50396825396825, 48.49603174603175], [78.4...
3 supports_neither Reject null hypothesis 185.351602 3.288824e-42 0.0125 1 [[65.78174603174604, 55.21825396825397], [71.2...

The results above show that we can reject the null hypothesis for those that support Duterte only, those that support both Marcos and Duterte, and those that support neither. This means that both Duterte-only and Marcos-Duterte supporters are highly likely to red-tag Anakbayan, while supporters of neither are highly unlikely to do so.

As for those who support Marcos only, we weren't able to obtain a statistically significant result, likely because the dataset contained only 10 samples from Marcos-only supporters.

Machine Learning Model¶

Choosing the number of topics¶

To choose the number of topics, we looked at the result of topic modelling with 3, 4, and 5 topics to see what would give us the most insights.

newplot (1).png

The figure above is when the number of topics is 3. With three topics, there is plenty of overlap between keywords without much to say about each individual topic.

newplot.png

The figure above is when the number of topics is 4. While the topics are more distinguishable, from the figure, the clusters are not compact. This can be seen especially in topic 1 with some dots that are closer to other topics than topic 1.

Initiate topic clustering¶

In [58]:
%%capture
# Initialize NLP components
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from textblob import TextBlob

!pip install pyspellchecker
from spellchecker import SpellChecker

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
spell = SpellChecker()
In [53]:
%%capture
!pip install emoji --upgrade
In [59]:
# Topic modeling via LDA
# Source: https://www.kaggle.com/code/infamouscoder/lda-topic-modeling-features
import re
import emoji

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Custom tokenizer
def tokenizer(text):
  text = emoji.replace_emoji(text, replace='')    # remove emojis
  text = re.sub(r"http\S+", "", text)             # remove URLs
  text = re.sub(r"[^\w\s]", "", text)             # remove whitespaces
  tokens = [word for word in word_tokenize(text) if len(word)>3]                           # keep only 4+-length words
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

  filtered_tokens = [token for token in lemmatized_tokens if token.lower() not in stop_words]
  # stemmed_tokens = [stemmer.stem(item) for item in filtered_tokens]
  return filtered_tokens

# Generate features
tf_vectorizer = TfidfVectorizer(tokenizer=tokenizer,
                                max_df=0.75, max_features=10000,
                                use_idf=True, norm=None, token_pattern=None)
tf_vectors = tf_vectorizer.fit_transform(tweets.final_tweet)

# Create top 5 topics
n_topics = 5
lda = decomposition.LatentDirichletAllocation(n_components=n_topics, max_iter=10,
                                              learning_method='online', learning_offset=50,
                                              n_jobs=1, random_state=42,
                                              topic_word_prior=0.000001)
W = lda.fit_transform(tf_vectors)
H = lda.components_

# Show top 10 relevant words for each of the 5 topics
num_words = 8
vocab = np.array(tf_vectorizer.get_feature_names_out())
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topic_words = ([top_words(t) for t in H])
topics = [' '.join(t) for t in topic_words]
df_topics = pd.DataFrame(topics, columns=['Keywords'])
df_topics['Topic ID'] = range(1, len(topics) + 1)
df_topics
Out[59]:
Keywords Topic ID
0 anakbayan scmp_uplb member mass nusphilippines... 1
1 anakbayan terrorist leader people youth activi... 2
2 anakbayan communist front accord anakbayan_ph ... 3
3 youth anakbayan partylist know group front fil... 4
4 anakbayan kabataan bayan like muna front group... 5

Visualize model¶

In [60]:
# Assign topic to each tweet
topicid = ["Topic" + str(i+1) for i in range(lda.n_components)]
tweetid = ["Tweet" + str(i+1) for i in range(len(tweets.final_tweet))]

df_topics_lda = pd.DataFrame(np.round(W,2), columns=topicid, index=tweetid)
significanttopic = np.argmax(df_topics_lda.values, axis=1)+1

df_topics_lda['dominant_topic'] = significanttopic
df_topics_lda['breakdown'] = df_topics_lda.apply(lambda row: '\n'.join([f'{col}: {row[col]}'
                                                        for col in sorted(df_topics_lda.columns, key=lambda x: row[x], reverse=True)
                                                        if row[col] > 0 and col != 'dominant_topic']), axis=1)
df_topics_lda.head(10)
Out[60]:
Topic1 Topic2 Topic3 Topic4 Topic5 dominant_topic breakdown
Tweet1 0.01 0.01 0.01 0.98 0.01 4 Topic4: 0.98\nTopic1: 0.01\nTopic2: 0.01\nTopi...
Tweet2 0.00 0.00 0.00 0.98 0.00 4 Topic4: 0.98
Tweet3 0.00 0.00 0.00 0.99 0.00 4 Topic4: 0.99
Tweet4 0.00 0.00 0.00 0.99 0.00 4 Topic4: 0.99
Tweet5 0.99 0.00 0.00 0.00 0.00 1 Topic1: 0.99
Tweet6 0.00 0.00 0.98 0.00 0.00 3 Topic3: 0.98
Tweet7 0.00 0.00 0.00 0.98 0.00 4 Topic4: 0.98
Tweet8 0.00 0.00 0.00 0.00 0.99 5 Topic5: 0.99
Tweet9 0.00 0.00 0.79 0.00 0.20 3 Topic3: 0.79\nTopic5: 0.2
Tweet10 0.00 0.00 0.00 0.00 0.99 5 Topic5: 0.99
In [66]:
# Visualize topics
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

# Apply t-SNE for dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(df_topics_lda.iloc[:,:n_topics])

# Apply K-means clustering
kmeans = KMeans(n_clusters=n_topics, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(df_topics_lda.iloc[:,:n_topics])
In [62]:
# Create a new dataframe with t-SNE coordinates and cluster labels
import textwrap

def split_text(text, max_length):
  lines = textwrap.wrap(text, width=max_length, break_long_words=False)
  return "<br>".join(lines)

df_topics_cluster = pd.DataFrame({'X': tsne_result[:, 0],
                                  'Y': tsne_result[:, 1],
                                  'Tweet': tweets['final_tweet'],
                                  'Cluster': df_topics_lda.reset_index()['dominant_topic'].astype(str), # topics via LDA
                                  # 'Cluster': cluster_labels},                                         # clusters via K-means
                                  'Breakdown': df_topics_lda.reset_index()['breakdown']})

df_topics_cluster['Tweet'] = df_topics_cluster['Tweet'].apply(lambda x: split_text(x, 40))
df_topics_cluster['Breakdown'] = df_topics_cluster['Breakdown'].str.replace('\n','<br>')

df_topics_cluster.head(10)
Out[62]:
X Y Tweet Cluster Breakdown
0 -15.042162 -147.134689 You killed them, not the government.<br>Don't ... 4 Topic4: 0.98<br>Topic1: 0.01<br>Topic2: 0.01<b...
1 -130.925217 -44.245590 Poor youth. They are wasting their<br>bright f... 4 Topic4: 0.98
2 -260.619843 -178.951859 Why do we need to protect our children<br>agai... 4 Topic4: 0.99
3 -260.619843 -178.951859 They are the group that knows nothing<br>but f... 4 Topic4: 0.99
4 -214.089203 161.981995 That's nice. LFS, Anakbayan, Kabataan,<br>Gabr... 1 Topic1: 0.99
5 138.431992 -439.232208 There are also woke people from the US.<br>Com... 3 Topic3: 0.98
6 -130.925217 -44.245590 They are saying this for a long time<br>now. E... 4 Topic4: 0.98
7 144.970978 375.820251 Hahaha. Like LFS? Anakbayan? These are<br>CPP/... 5 Topic5: 0.99
8 317.921783 -370.787506 For the supporters of NPA legal fronts<br>like... 3 Topic3: 0.79<br>Topic5: 0.2
9 144.970978 375.820251 Miserable? 😂 CPP-NPA linked behind this<br>act... 5 Topic5: 0.99
In [64]:
# Plot tweets as colored points
df_topics_cluster.sort_values('Cluster', key=lambda x: pd.to_numeric(x, errors='coerce'), inplace=True)

fig = px.scatter(df_topics_cluster, x='X', y='Y', color='Cluster',
                 title='Topic Clustering using LDA and t-SNE with 5 topics',
                 hover_name='Tweet',
                 hover_data={'X':False, 'Y':False, 'Cluster':False, 'Tweet':False, 'Breakdown':True})

for i, keyword in enumerate(df_topics['Keywords']):
  fig.add_annotation(
    x=0,
    y=-0.2*(i/5)-0.08,
    text="Topic %d: %s"%(i+1, keyword.replace(' ', ', ')),
    showarrow=False,
    xref='paper',
    yref='paper',
    align='left',
    font=dict(color=fig.data[i].marker['color'])
  )

fig.update_layout(height=710,
                  xaxis_title='', yaxis_title='',
                  margin=dict(b=200),
                  width=700,
                  paper_bgcolor='#2c3e50',
                  title=dict(font=dict(color='white')),
                  legend=dict(title="Topic", font=dict(color='white')))
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'modeling-5-topics', auto_open=True)

With five topics, we see an improvement from four topics where the topics have some distinguishing features with the actual dots being closer together. We can distinguish the topics as follows:

  • Topic 1: This topic mentions Student Christian Movement of the Philippines - UPLB (scmp_uplb) and National Union of Students of the Philippines (nusphilippines) which means that many tweets talk about Anakbayan together with student organizations.
  • Topic 2: This topic shows that many tweets call Anakbayan and its activist activities terrorists/terrorism.
  • Topic 3: This topic shows many tweets also call Anakbayan communist and fronts for communist groups.
  • Topic 4: This topic shows that many tweets talk about Anakbayan together with youth organizations or partylists. It also covers tweets that call Anakbayan a partylist (which it is not), perhaps conflating it with the Akbayan Partylist.
  • Topic 5: This topic shows many tweets talk about anakbayan together with Kabayaan and Bayan Muna partylists, which are from the Makabayan bloc.

Scattered among the topics is the keyword front which is used in many tweets in general that call Anakbayan a front for the CPP-NPA.